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Introduction 


Summary 


@ Introduction 
e Short range N-body simulation 
@ Overview of the simulation 


Introduction 
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Molecular dynamics (MD) 


o 


computer simulation of a system of particles; 
N-body problem (cut-off distance): 
e forces are neglected if dist(partl, part2) > re. 


o 


simulate hundreds of millions of particles; 


verify simulation results with real experiments (physicist). 


use multiple accelerators on a single node: 
integrate the simulation to ExaStamp (CEA): 
e a parallel framework for MD on heterogeneous clusters. 


@ 
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Introduction 


Overview of the simulation 


Figure : Overview of the interactive simulation (OpenGL + OpenCL app) 
with around 2 million particles 


Background 


Summary 


@ Background 
e OpenCL programming model 
e NVIDIA GPU execution model 
e Intel Xeon Phi execution model 
e OpenCL best practices 
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OpenCL pi 


What is OpenCL ? 


@ a standard for parallel programming of heterogeneous systems: 


@ initially influenced by GPU execution models: 


@ but now available on different architectures, including CPUs. 


OpenCL portability 
@ the performance portability is not always guaranteed: 
e because there are different HW designs (GPUs, CPUs, etc). 


Do you need to have different optimizations for different devices ? | 
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OpenCL programming m 


e Device - GPU, CPU, etc.: 
@ Work-item - Thread: 


Work-group - Group of work-items; 


o 


o 


Memory spaces: 
e Private - Work-item memory; 
e Local - Memory shared by work-items in a work-group; 
e Global - Memory shared by all work-items; 
e Constant - Read-only global memory. 


OpenCL Runtime 


e Device creation; 
e Buffer management; 
e Kernel dispatch. 
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Background 
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OpenCL programming model 


ScalVec kernel example 


@ vector vec is located in global memory; 


@ one work-item per vector element is used. 


_ kernel void Scalvec( global float “vec, float k) 
{ 
int index = get_global_id(0); 


vec[index] *= k; 


} 
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Figure : ScalVec kernel 
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| Tejo) 


NVIDIA GPU execution model 


Streaming processor (SP) 


e interleaved execution of 
sequential hardware threads; 


hardware AE 


thread — 
e context switch is free (avoid 


SA SA A A 


stalling on memory load). $ | m 
- - streaming h 
Streaming multiprocessor (SM) processor 
e hosts groups of hardware double 
threads; precision 
unit 


@ local memory sharing and 


synchronization. Figure : Cluster of SIMD units 


Global memory is shared by all streaming multiprocessors 


Background 
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NVIDIA GPU execution model 


Streaming multiprocessor 
e several OpenCL work-groups 
can reside on the same SM; 


e limited by hardware 
resources: 
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e registers; 
e local memory; 
e max HW threads per SP. 


Shared local memory 


@ much faster than global 
memory (shared by all SMs); 


e only a few kBytes! Figure : Streaming Multiprocessor 
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Xeon Phi & OpenCL 
e 61 cores, 244 threads (4x threads interleaved); 


e driver creates 240 SW threads which are pinned on each core: 
e threads scheduling in software (overhead). 


e each work-group is executed sequentially by one thread. 


Implicit vectorization 


e kernels are implicitly vectorized along dimension 0: 


e vector size of 16 elements. 


__Kernel void foo(...) 
For (int i = 0; i < get_local_size(2); i++) 
For (int j = O; j < get_local_size(1); j++) 
For (int k = 0; k < get local size(O): k += VECTOR_SIZE) 
Vectorized Kernel Body, 
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Background 
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e use tiling in local shared memory (much faster); 
@ memory accesses must be coalesced whenever possible; 


@ avoid different execution paths inside the same WG. 


Intel Xeon Phi 


@ do not use local memory and avoid barriers: 


e no physical scratchpad local memory; 
e no special HW support, so barriers are emulated by OpenCL. 


@ code divergence may prevent successful vectorization: 


@ limit the number of kernels (software scheduling overhead). 


Contributions 


Summary 


@ Contributions 
@ Multi accelerators strategy 
e Distribute the work 
@ Transfer of particles 
@ Overlap memory accesses 
e Parallelization strategy 
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Contributions 
e 


Initial version 
e single accelerator version for NVIDIA GPUs; 
e developed by Raymond Namyst. 


Objectives 
e use multiple accelerators on a single node; 
e distribute the work among accelerators; 
e transfer particles between accelerators whenever it's needed: 
e to maintain physical properties (cf. cut-off distance). 


e overlap memory accesses and optimize OpenCL code. 
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Contributions 


Distribute the work 


How to split the 3D space ? 


@ spatial decomposition at the initialization: 


@ global domain splitted in Z plans of size rc (cut-off distance). 


global domain Z plan (size r.) particles 


Figure : 2D overview of the spatial decomposition with 3 sub-domains 
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Contributions 
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Transfer of particles 


Borders management 


@ duplicate borders to maintain physical properties; 
e a border is a Z plan with "ghost particles": 


e "ghost particles" belong to a close sub-domain. 


Z borders of node 0 ghost particles 


borders of node 1 


Figure : Exploded view of borders duplication with "ghosts particles" 
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Contributions 
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Transfer of particles 


Particles out-of-domain 


@ particles move during the simulation: 
@ a particle can move from a sub-domain to another one; 


@ need to transfer these particles after each iteration. 


Figure : At the next step, the red particle will belong to the node 1, and 
the blue particle will belong to the node 0 
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Contributions 
e 


Overlap memory accesses 


Overlap memory accesses with HW computation 


@ parallel decomposition of the problem: 


e left and right borders are processed before the center; 
e allows to transfer borders while the center is processing. 


i left center right 1 


Figure : Parallel decomposition : left and right borders are processed 
before the center to allow to transfer borders 
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Contributions 
e 


Parallelization strategy 


Important points 


e the most costly kernel: 


@ one thread per particle: 


@ 27 cells to compute forces 
with neighbors: 
e particles sorted at each 
iteration; 
e coalesced accesses along 
X axis. 


@ two implementations (GPU 
& CPU/MIC): 

e for performance & code 
readability. 


Figure : Computation of forces with 
neighbors (27 cells) 
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Contributions 
e 


e global domain needs to be homogeneous (static distribution); 


@ the slowest compute node slows down all others. 


Discussion : load balancing 


@ idea: use a supervised learning based on execution times; 
@ profile performance of compute nodes; 


e transfer Z plans between accelerators. 
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Evaluation 


Summary 


@ Evaluation 
e Single accelerator 
@ Multi accelerators 
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Evaluation 
e 


Single accelerator 
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Figure : Time in microseconds for one iteration with one million particles 
in simple and double precision 


Evaluation 
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Multi accelerators 


70 
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Throughput (#Mparticles/i/s) 


Number of GPUs 


Figure : Throughput according to the number of GPUs (3xTesla M2075), 
in simple precision with around one million particles on each GPU 


Conclusion 


Summary 


@ Conclusion 
e Questions & Discussions 
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Conclusion 
e 


Conclusion 


Current status 


@ more than 90M particles on accelerators with 5GB RAM: 
e single precision performance results: 
e 61 Mparticles/i/s with 3xNVIDIA Tesla M2075 (gain: 2.9). 


e works quite well with NVIDIA GPUs and Intel Xeon Phi. 


Much potential (and ideas) for improvement 


e load balancing between accelerators; 
@ some optimizations are still applicable on Xeon Phi; 


@ OpenCL kernels differ from one architecture to another: 
e OpenCL 2.0 could be a good start! 
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Questions & Discussions 


Questions & Discussions 


Conclusion 
e 


